The question provides a data set of variables associated with house prices in Saratoga. We have data for more than 1,700 houses which include their prices, landvalue and other attributes like number of bedrooms, bathrooms, living area, lotsize etc. The task is to develop models for predicting the market prices of houses for tax authorities so that they can tax them at their market value. We use the given sample to construct two different models for this question.
The first part of the question asks us to handbuild a linear regression model with price as dependent variable and using all other variables as independent variables. We start by assessing the medium model provided in Professor’s script and check its RMSE by running it on 1000 different train/test samples.
Professors Medium model
lm_medium = lm(price ~ lotSize + age + livingArea + pctCollege + bedrooms +
fireplaces + bathrooms + rooms + heating + fuel + centralAir
## [1] "RMSE for Medium"
## [1] 66543.45
We make new variables like extrarooms = rooms - bedrooms. Also we include two variables landvalue and newConstruction which improves our RMSE. However, trying composite variables like livingarea per lotsize, bathrooms per bedroom and using building value by subtracting landvalue from the property price, did not improve the out of sample RMSE of the model.
We used Step() function to narrow down variables and interactions that can give us low variance but the lowest AIC model did not perform better at out of sample RMSE in multiple iterations. Including more interaction variables and polynomials manually and one by one also did not help.
Looking at the co-efficients, we can say that lotsize, no. of bedrooms, no. of bathrooms, living area, central air, heating, fuel and land value are the most important variables in explaining the prices of houses in the sample. Some interaction variables also come out to be significant in the regression model but they do not contribute much to out of sample RMSE and in most cases increase error in out of sample prediction.
So we decided to have the following model as our final best linear regression model for house prices.
Best Linear Regression Fit
lm(price ~ landValue+lotSize+ livingArea+ bedrooms+ bathrooms+ extrarooms + centralAir + heating + age+ newConstruction+ fireplaces + fuel + age + pctCollege)
## [1] "RMSE for Best Linear Model"
## [1] 59657.24
In the third part, the question asks us to fit a K-nearest neighbor model. We select the same variables as our linear model and scale them accordingly to fit a KNN model. We did 300 loops for each K starting from 1 to 300 K’s. The average RMSE declines in the range of 100 to 150 K. However, exact value of K with minimum average RMSE changes with each iteration of 500 training/ test splits for each K.We selected K = 135 based on our 500 training/ tests sample splits. It gave an RMSE of 80616.
## [1] 80616.88
We have two models for predicting the prices of houses in Saratoga. One is the linear regression model and the other one in KNN model. Both these models have their strengths and weaknesses. The main metric for comparing these two models is to check their out of sample prediction error or RMSE. By running the model on more than 500 different train/ test samples we find out that Linear regression model has lower RMSE which means that on average linear model is predicting prices accurately as compared to KNN model.
Linear Regression model Mean RMSE = 59,536.05 KNN Model Mean RMSE at K-135 = 80,616.88
We run both these models on a same train set and predict values for both these models on same test to compare their RMSE and Fit for same data points.
## [1] "LM RMSE"
## [1] 63574.17
## [1] "KNN RMSE"
## [1] 77759.13
Looking at the actual vs predicted plot we see that KNN model’s predictions are more spread out than LM’s model predictions. We can see that LM model’s prediction are evenly distributed around the center line whereas the KNN model’s predictions tend to be on the lower side of the line thus indicating on average lower prediction of prices as compared to the actual one.Here we see that LM model has better predictions with lower RMSE.
We can also see that the predictions for higher prices are far from the actual prices for both models. This means that both models are not performing good at extreme values. We check the performance of both models on prices in lower and higher percentiles and check how their RMSE perform at the fringe.
We run 100 train/ test random splits of the sample and run both models for every train/test case and then check for RMSE of both models at different percentile of prices. The table below shows that the KNN model has higher error for higher percentile data. That means houses with higher prices are predicted more inaccurately as compared to houses with average prices. The RMSE of LM model is also high but lower than KNN model but for lower percentiles, LM model has slightly higher RMSE than KNN, however the difference is not as stark as for higher percentile values. Here we can also prefer LM model over KNN as it also performs better at extreme values.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.